#' ---
#' title: "Get data from multiple netcdf files"
#' author: "James McCreight"
#' date: "`r Sys.Date()`"
#' output: rmarkdown::html_vignette
#' vignette: >
#'   %\VignetteIndexEntry{Get data from multiple netcdf files}
#'   %\VignetteEngine{knitr::rmarkdown}
#'   \usepackage[utf8]{inputenc}
#' ---
#' 
#' # Background
#' The output of WRF Hydro model runs is spread over multiple netcdf files and multiple netcdf files in output categories. This tool presents a list-based approach to gathering all the timeseries data you need at once (over both time and output  file types) with parallelization at the file level for enhanced speed. The main limitation is that each variable specified may only return a scalar for a given time. However, statistics can summarize spatial fields or the same variable can be specified multiple times with different index. (There is no efficient way to get bulk spatial data over time.)
#' 
#' # Setup
#' Load the rwrfhydro package. 
## ------------------------------------------------------------------------
library("rwrfhydro")

#' 
## ---- echo=FALSE---------------------------------------------------------
options(width = 190)

#' 
#' Set the WRF Hydro test case directory, this should be the only thing you need to customize to your machine. We'll use daily data from the Fourmile Creek test case where the full routing physics are used. 
## ------------------------------------------------------------------------
dataPath <- '~/wrfHydroTestCases/Fourmile_Creek_testcase_v2.0/run.FullRouting/'

#' 
#' # List-based data retrieval
#' The `GetMultiNcdf` function needs 3 *collated* lists as input: `filesList`, `variableList`, and `indexList` (e.g. `args(GetMultiNcdf)`). (Note these are the formal argument names which may be shortend when calling the function, by R rules, upto uniqueness of the argument name. We construct variables in the global workspace to pass to these functions which have similar but not identical names.)
#' 
#' ## `filesList`
#' This is a named list of arbitrary length. The individual entries corrospond to different file types of mode output and contain all the files you want to look at. These are usually generated by `files.list()` in a directory. Here output files from the land surface model and RESTART files from the hydro model are desired. These are found individually and then placed into `fileList`
## ------------------------------------------------------------------------
lsmFiles <- list.files(path=dataPath, pattern='LDASOUT_DOMAIN', full.names=TRUE)
hydroFiles <- list.files(path=dataPath, pattern='HYDRO_RST', full.names=TRUE)
flList <- list(lsm=lsmFiles, hydro=hydroFiles)

#' Note that the output and restart frequencies are different from the number of files of each type.
#' 
#' ## `variableList`
#' All three lists are collated by name. Now we define which variables are desired for each file group. From the land surface model (LSM) output files, we want the surface radiative temperature (TRAD) and the snow water equivalent (SNEQV). For the  hydro restart files, we want streamflow and soil moisture on the four vertical soil layers. The layers are differentiated, for now, only by the names we give them (smc1-4). 
## ------------------------------------------------------------------------
lsmVars   <- list(TRAD='TRAD', SWE='SNEQV')
hydroVars <- list(streamflow='qlink1', smc1='sh2ox1', smc2='sh2ox2', smc3='sh2ox3', smc4='sh2ox4')
varList <- list(lsm=lsmVars, hydro=hydroVars)

#' 
#' ## `indexList`
#' The indexList defines what indices/stats are desired for each variable in each file group. This list is collated with both of the previous two lists in a nested way, illustrated below. 
#' 
#' Only a scalar can be returned for each entry specified index. However, spatial fields (a range of indices) can be summarized using arbitrary statistics. We show how to define your own useful statistics which can be used when specifying the indexList. (Note that the envir argument may be needed to get your function inside of GetMultiNcdf in special circumstances.)
#' 
#' Our statistic example is to calculate basin-average radiative temperature, basin-maximum snow water equivalent, and basin-average soil moisture on each layer. Since all of these variables are on the low-res grid, we need the basin mask from the high-res grid resampled to the low-res geogrid. We use the CreateBasinMask function to generate a basin mask weight grid (each cell value is the fraction of basin within that cell). We specify the path to the high-res routing grid (which contains the basin mask variable), the basin ID we want to run (1), and the aggregation factor between the high-res and low-res grids (10).
## ------------------------------------------------------------------------
library(ncdf4)
basinMask <- CreateBasinMask(paste0(dataPath,'DOMAIN/Fulldom_hires_hydrofile.Fourmile100m.nc'), 
                             basid=1, aggfact=10)

#' 
#' Then, we use this resampled basin mask to setup basin mean and max functions, which determine the mean and maximum values of a variable within the basin mask.
## ------------------------------------------------------------------------
basAvg <- function(var) sum(basinMask*var)/sum(basinMask)
basMax <- function(var) max(ceiling(basinMask)*var)

#' 
#' For fun, we can also calculate the size of the basin while we are here. The LSM pixels are 1km:
## ------------------------------------------------------------------------
basinKm2 <- sum(basinMask)
basinKm2

#' 
#' Now we can construct the indexList. For the LSM, we want spatial summaries of TRAD and SNEQV at each time. Statistical summaries are requested using list for each variable. The list specifies a grid 'slice' (as distinct from a subset, i.e. a slice is how ncdf4 lets one subset matrices) on which to compute a *named* statistic. The required names in this list are `start`, `end`, and `stat` in the list. 
#' Statistic lists are also specified for the individual soil moisture layers, 1-4, in the hydro restart files. *Note that the dimensions are reverse order from what is shown in "ncdump -h".* For the discharge, qlink1, we dont supply a list. Instead we only give the integer index where flow is desired. (See the "WRF Hydro Domain and Channel Visualization" vignette to understand why index 1 on the stream channel just happens, conincidentally, to be the basin outlet.)
## ------------------------------------------------------------------------
lsmInds   <- list(TRAD=list(start=c(1,1,1), end=c(21,7,1), stat='basAvg'),
                  SNEQV=list(start=c(1,1,1), end=c(21,7,1), stat='basMax'))
hydroInds <- list(qlink1=1,
                  smc1=list(start=c(1,1), end=c(21,7), stat='basAvg'),
                  smc2=list(start=c(1,1), end=c(21,7), stat='basAvg'),
                  smc3=list(start=c(1,1), end=c(21,7), stat='basAvg'),
                  smc4=list(start=c(1,1), end=c(21,7), stat='basAvg') )
indList <- list(lsm=lsmInds, hydro=hydroInds)

#' 
#' ## `GetMultiNcdf()`
#' All the work is really in setting up the lists. Now we just pass these to `GetMultiNcdf`. The preceeding two lines setup optional parallelization of the data grabs. 
## ------------------------------------------------------------------------
library(doMC)   ## Showing parallelization, which is at the file level within
registerDoMC(3) ## each file groups; pointless to be longer than your timeseries.
fileData <- GetMultiNcdf(filesList=flList, variableList=varList, indexList=indList, parallel=FALSE)

#' 
#' What did we get?
## ------------------------------------------------------------------------
str(fileData)

#' 
#' The `fileData` dataframe shows the time (`POSIXct`) at which certain indices (`inds`) were summarized with statistic `stat` for each `variable` (variable names in the file, e.g. sh2ox) The resulting `value` is given with the `variableGroup` (e.g. smc1-4 and not sh2ox) and `fileGroup`. 
#' 
#' # Plot the timeseries
#' This output format is easily plotted using `ggplot2`. 
## ----results='hold', fig.width = 12, fig.height = 10.29*1.2, out.width='700', out.height='720'----
library(ggplot2)
library(scales)
ggplot(fileData, aes(x=POSIXct, y=value, color=fileGroup)) +
  geom_line() + geom_point() +
  facet_wrap(~variableGroup, scales='free_y', ncol=1) +
  scale_x_datetime(breaks = date_breaks("1 month")) + theme_bw()

